-
Notifications
You must be signed in to change notification settings - Fork 3k
Tool Call Accuracy V2 #41740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tool Call Accuracy V2 #41740
Conversation
Thank you for your contribution @salma-elshafey! We will review the pull request and get back to you soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the Tool Call Accuracy Evaluator to use a scoring rubric ranging from 1 to 5 instead of a binary score and evaluates all tool calls in a single turn collectively. Key changes include:
- Transition from a binary scoring system (0/1) to a detailed 1–5 rubric.
- Consolidation of tool call evaluations per turn with enhanced output details.
- Updates to test cases, sample notebooks, and documentation to align with the new evaluation logic.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_accuracy_evaluator.py | Updated unit tests to verify new scoring and output details. |
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_agent_evaluators.py | Modified tests for missing input cases and tool definition validations. |
sdk/evaluation/azure-ai-evaluation/samples/agent_evaluators/tool_call_accuracy.ipynb | Revised sample to demonstrate updated evaluator usage and scoring. |
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py | Core evaluator logic modified to support the new scoring rubric and input handling. |
sdk/evaluation/azure-ai-evaluation/CHANGELOG.md | Changelog updated to reflect improvements to the evaluator. |
Comments suppressed due to low confidence (1)
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py:150
- The current logic overrides a provided 'tool_calls' parameter with those parsed from 'response' when present, which may not align with the documented behavior; consider preserving the explicitly provided 'tool_calls' when both are supplied.
tool_calls = parsed_tool_calls
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Outdated
Show resolved
Hide resolved
@microsoft-github-policy-service agree [company="Microsoft"] |
@microsoft-github-policy-service agree company="Microsoft" |
...ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty
Show resolved
Hide resolved
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Outdated
Show resolved
Hide resolved
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Show resolved
Hide resolved
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Outdated
Show resolved
Hide resolved
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Outdated
Show resolved
Hide resolved
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Show resolved
Hide resolved
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Outdated
Show resolved
Hide resolved
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Outdated
Show resolved
Hide resolved
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Outdated
Show resolved
Hide resolved
...ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty
Outdated
Show resolved
Hide resolved
...ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty
Outdated
Show resolved
Hide resolved
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Outdated
Show resolved
Hide resolved
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_accuracy_evaluator.py
Show resolved
Hide resolved
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Outdated
Show resolved
Hide resolved
...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
Outdated
Show resolved
Hide resolved
Head branch was pushed to by a user without write access
c983593
to
3d4f2cc
Compare
Head branch was pushed to by a user without write access
Description
This PR introduces a new version of the Tool Call Accuracy Evaluator with lower intra- and inter-model variance compared to V1.
It introduces:
With V2, we achieved an improvement on human-alignment scores of 11% compared to V1, as shown in the table below:

All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines